When Ignorance is Bliss 



Peter D. Griinwald 

CWI, P.O. Box 94079 
1090 GB Amsterdam 
pdg@cwi.nl 
http://www.grunwald.iil 

Abstract 

It is commonly-accepted wisdom that more in- 
formation is better, and that information should 
never be ignored. Here we argue, using both 
a Bayesian and a non-Bayesian analysis, that in 
some situations you are better off ignoring infor- 
mation if your uncertainty is represented by a set 
of probability measures. These include situations 
in which the information is relevant for the pre- 
diction task at hand. In the non-Bayesian anal- 
ysis, we show how ignoring information avoids 
dilation, the phenomenon that additional pieces 
of information sometimes lead to an increase in 
uncertainty. In the Bayesian analysis, we show 
that for small sample sizes and certain predic- 
tion tasks, the Bayesian posterior based on a non- 
informative prior yields worse predictions than 
simply ignoring the given information. 



1 INTRODUCTION 

It is commonly-accepted wisdom that more information is 
better, and that information should never be ignored. In- 
deed, this has been formalized in a number of ways in 
a Bayesian framework, where uncertainty is represented 
by a probability measure [Good 1967; Raiffa and Shlaifer 
1961]. In this paper, we argue that occasionally you are 
better off ignoring information if your uncertainty is repre- 
sented by a set of probability measures. Related observa- 
tions have been made by Seidenfeld [2004]; we compare 
our work to his in Section 5. 

For definiteness, we focus on a relatively simple setting. 
Let X he a random variable taking values in some set X, 
and let F be a random variable taking values in some set 
3^. The goal of an agent is to choose an action whose utility 
depends only on the value of Y, after having observed the 
value of X. We further assume that, before making the 
observation, the agent has a prior Pry on the value of y . If 
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the agent actually had a prior on the joint distribution of X 
and Y, then the obvious thing to do would be to condition 
on the observation, to get the best estimate of the value 
of Y. But we are interested in situations where the agent 
does not have a single prior on the joint distributions, but a 
family of priors V. As the following example shows, this 
is a situation that arises often. 

Example 1.1: Consider a doctor who is trying to decide if 
a patient has a flu or tuberculosis. The doctor then learns 
the patient's address. The doctor knows that the patient's 
address may be correlated with disease (tuberculosis may 
be more prevalent in some parts of the city than others), 
but does not know the correlation (if any) at all. In this 
case, the random variable Y is the disease that the patient 
has, y — {flu, tuberculosis}, and X is the neighborhood in 
which the agent lives. The doctor is trying to choose a treat- 
ment. The effect of the treatment depends only on the value 
of Y. Under these circumstances, many doctors would sim- 
ply not take the patient's address into account, thereby ig- 
noring relevant information. In this paper we show that this 
commonly-adopted strategy is often quite sensible. | 

There is a relatively obvious sense in which ignoring infor- 
mation is the right thing to do. Let P be the set of all joint 
distributions onXxy whose marginal on Y is Pry. V rep- 
resents the set of distributions compatible with the agent's 
knowledge. Roughly speaking, if a* is the best action given 
just the prior Pry, then then a* gives the same payoff for 
all joint distribution Pr G "P (since they all have marginal 
Pry) We can show that every other action a' will do worse 
than a* against iome joint distribution Pr G V. Therefore, 
ignoring the information leads one to adopt the minimax 
optimal decision. This idea is formalized as Proposition 2. 1 
in Section 2, where we also show that ignoring information 
compares very favorably to the "obvious" way of updating 
the set of measures V. Proposition 2. 1 makes three impor- 
tant assumptions: 

1. There is no (second-order) distribution on the set of 
probabilities V. 



2. V contains all probability distributions on X x y 
whose marginal is Pry. 

3. The "goodness" of an action a is measured by some 
loss or utiUty that — although it may be unknown to the 
agent at the time of updating — is fixed. In particular, 
it does not depend on the observed value of X. 

In the remainder of the paper, we investigate the effect of 
dropping these assumptions. In Section 3, we consider 
what happens if we assume some probability disttibution 
on the set V of probabilities. The obvious question is 
which one to use. We have to distinguish between purely 
subjective Bayesian approaches and so-called "noninfor- 
mative", "pragmatic", or "objective" Bayesian approaches 
[Bernardo and Smith 1994], which are based on adopt- 
ing so-called "non-informative priors". We show that for 
a large class of such priors, including the uniform dis- 
tribution and Jeffreys' prior, using the Bayesian posterior 
may lead to worse decisions than using the prior Pry; that 
is, we may be better off ignoring information rather than 
conditioning on a noninformative prior; see Examples 3.1 
and 3.2. In these examples, the posterior is based on a 
relatively small sample. Of course, as the sample grows 
larger, then using any reasonable prior will result in a pos- 
terior that converges to the true distribution. This fol- 
lows directly from standard Bayesian consistency theorems 
[Ghosal 1998]. 

In Section 4 we investigate the effect of dropping the sec- 
ond and third assumptions. We show that once there is par- 
tial information about the relationship between X and Y 
(so that P is a sttict subset of the set of all probability dis- 
tributions aaX xy whose marginal is Pry), then the right 
thing to do becomes sensitive to the kind of "bookie" or 
"adversary" that the agent can be viewed as playing against 
(cf. [Halpem and Tuttle 1993]). We consider some related 
work, particularly that of Seidenfeld [2004], in Section 5. 
Our focus in this paper is on optimality in the minimax 
sense. It is not clear that this is the most appropriate notion 
of optimality. Indeed, Seidenfeld explicitly argues that it 
is not, and the analysis in Section 4 suggests that there are 
situations when ignoring information is a reasonable thing 
to do, even though this is not the minimax approach. We 
discuss alternative notions on optimality in Section 5. We 
conclude with further discussion in Section 6. 

2 WHEN IGNORING HELPS: A 
NON-BAYESIAN ANALYSIS 

In this section, we formalize our problem in a non-Bayesian 
setting. We then show that, in this setting, under some prag- 
matic assumptions, ignoring information is a sensible strat- 
egy. We also show that ignoring information compares fa- 
vorably to the standard approach of working with sets of 
measures onX xy. 



As we said, we are interested in an agent who must choose 
some action from a set A, where the loss of the action de- 
pends only on the value of a random variable Y, which 
takes values in y. We assume that with each action a E A 
and value y G 3^ is associated some loss to the agent. 
(The losses can be negative, which amounts to a gain.) Let 
L : y X A ^ {co] be the loss function.^ For ease of 
exposition, we assume in this paper that A is finite. 

For every action a & A, let La be the random variable on y 
such that La{y) = L{y, a). Since A is assumed to be finite, 
for every distribution Pry on y, there is a (not necessarily 
unique) action a* € A that achieves minimum expected 
loss, that is, 

inf (^Pr,[L„]) =£p,^[L„.] (1) 

aeA 

If all the agent knows is Pry, then it seems reasonable for 
the agent to choose an action o* that minimizes expected 
loss. We call such an action a* an optimal action for Pry. 

Suppose that the agent observes the value of a variable X 
that takes on values in X. Further assume that, although 
the agent knows the marginal disttibution Pry of Y, she 
does not know how Y depends on X. That is, the agent's 
uncertainty is characterized by the set V consisting of all 
distributions on A" x 3^ with marginal disttibution Pry on 
y. The agent now must choose a decision rule that deter- 
mines what she does as a function of her observations. We 
allow decision rules to be randomized. Thus, if A(^) con- 
sists of all probability distributions on A, a decision rule is 
a function 5 : X ^ A(-^) that chooses a distribution over 
actions based on her observations. Let 'D{X, A) be the set 
of all such decision rules. A special case is a determinis- 
tic decision rule, that assigns probability 1 to a particular 
action. If 8 is deterministic, we sometimes abuse notation 
and write 5{:r) for the action that is assigned probability 
1 by the distribution 5{x). Given a decision rule 3 and a 
loss function L, let Ls be the random variable on A" x ^ 
such that L5(x,j/) = Xlae^ '^)- Here (5(a;)(a) 
stands for the probability of performing action a according 
to the disttibution 5{x) over actions that is adopted when x 
is observed. Note that in the special case that (5 is a deter- 
ministic decision rule, then Ls{x, y) = L{y, S{x)). More- 
over, if 6a is the (deterministic) decision rule that always 
chooses a, then Ls^ {x, y) = La{y). 

The following result, whose proof we leave to the full pa- 
per, shows that the decision rule S* that always chooses an 
optimal action a* for Pry, independent of the observation, 
is optimal in a minimax sense. Note that the worst-case ex- 
pected loss of decision-rule 5 is supp^gp Epi-[Ls]. Thus, 
the best worst-case loss (i.e., the minimax loss) over all de- 
cision rules is id.{s^i,(^x,a) ^^Vpver E-priLs]. 

^We could equally well use utilities, which can be viewed as 
a positive measure of gain. Losses seem to be somewhat more 
standard in this literature. 



Proposition 2.1: Suppose that Pry is an arbitrary distri- 
bution on y, L is an arbitrary loss function, V consists of 
all distributions on X x y with marginal Pry, and a* is 
an optimal action for Pry (with respect to the loss function 
L). Then Ep,^ [La*] = mis^iy(^x,A) supprgp Epr[Ls]. 

A standard decision rule when uncertainty is represented 
by a set V of probability measures is the Maxmin Expected 
Utility Rule [Gilboa and Schmeidler 1989]; compute the 
expected utility (or expected loss) of an action with re- 
spect to each of the probability measures in V, and then 
choose the action whose worst-case expected utility is best 
(or worst-case expected loss is least). Proposition 2.1 says 
that if V consists of all probability measures with marginal 
Pry and the loss depends only on the value of Y, then the 
action with the least worst-case loss is an optimal action 
with respect to Pry. 

Example 2.2: Consider perhaps the simplest case, where 
X = y = {0, 1}. Suppose that our agent knows that 
Epiy [Y] = Pry (y = 1) = p for some fixed p. As before, 
let P be the set of distributions on <Y x 3^ with marginal 
Pry. Suppose further that the only actions are and 1 (in- 
tuitively, these actions amount to predicting the value of 
Y), and that the loss function is if the right value is pre- 
dicted and 1 otherwise; that is, L{i,j) = \i — j\. This is 
the so-called 0/1 or classification loss. It is easy to see that 
E[Lo] = p and E[Li] = 1 — p, so the optimal act is to 
choose if p < .5 and 1 if p > .5 (both acts have loss 1/2 
if p = .5). The loss of the optimal act is mm{p, 1 — p). 

Perhaps the more standard approach for dealing with un- 
certainty in this case is to work with the whole set of dis- 
tributions. Assume that < Pr(y = 1) = p < 1. Let 
V, = {Pr(- I X = i) : Pr e -P}, i = 0, 1. Then for all 
q G [0, 1], both Vo and Vi contain a distribution Pr^ such 
that Prg(F = I) = q. In other words, Vq = Vi = A(:V), 
the set of all distributions on y. Observing X = x causes 
all information about Y to be lost. Remjirkably, this holds 
no matter what value of X is observed. 

Thus, even though the agent knew that Pr(y = 1) = p 
before observing X, after observing X, the agent has no 
idea of the probability that Y = \. This is a special case 
of a phenomenon that has been called dilation in the statis- 
tical and imprecise probability literature [Augustin 2003; 
Cozman and Walley 2001 ; Herron, Seidenfeld, and Wasser- 
man 1997; Seidenfeld and Wasserman 1993]: it is possi- 
ble that lower probabilities strictly decrease and that upper 
probabilities strictly increase, no matter what value of x is 
observed. Dilation has severe consequences for decision- 
making. The minimax-optimal decision rule 5* with re- 
spect to Vy is to randomize, choosing both and 1 with 
probability 1/2. Note that, no matter what Pr 6 "P actually 
obtains, 

Ep,[La*] = min{p, 1 - p} ; Ep,[Ls^] = 1/2. 



Thus, if p is close to or 1 , ignoring information does much 
better than making use of it. 

This can be viewed as an example of what decision theo- 
rists have called time inconsistency. Suppose, for definite- 
ness, that p = 1 /3. Then, a priori, the optimal strategy is 
to decide no matter what. On the other hand, if either 
X = or X = 1 is observed, then the optimal action is 
to randomize. When uncertainty is described with a single 
probabiUty distribution (and updating is done by condition- 
ing), then time inconsistency cannot occur.^ | 

3 WHEN IGNORING HELPS: A 
BAYESIAN ANALYSIS 

Suppose that, instead of having just a set V of probability 
measures, the agent has a probability measure on V. But 
then which probability measure should she take? Broadly 
speaking, there are two possibilities here. We can con- 
sider either purely subjective Bayesian agents or pragmatic 
Bayesian agents. A purely subjective Bayesian agent will 
come up with some (arbitrary) prior that expresses her sub- 
jective beliefs about the situation. It then makes sense to as- 
sess the consequences of ignoring information in terms of 
expected loss, where the expectation is taken with respect 
to the agent's subjective prior. Good's total evidence the- 
orem, a classical result of Bayesian decision theory [Good 
1967; Raiffa and Shlaifer 1961], states that, when taking 
the expectation with respect to the agent's prior, the opti- 
mal decision should always be based on conditioning on 
all the available information — information should never be 
ignored. 

In contrast, we consider an agent who adopts Bayesian up- 
dating for pragmatic reasons (i.e., because it usually works 
well) rather than for fundamental reasons. In this case, be- 
cause computation time is limited and/or prior knowledge 
is hard to obtain or formulate, the prior adopted is typically 
easily computable and "noninformative", such as a prior 
that is uniform in some natural parameterization of V. We 
suspect that many statisticians are pragmatic Bayesians in 
this sense. (Indeed, most "Bayesian" UAI and statistics pa- 
pers adopt pragmatic priors that cannot seriously be viewed 
as fully subjective.) When analyzing such a pragmatic ap- 
proach, it no longer makes that much sense to compare 
ignoring information to Bayesian updating on new infor- 
mation by looking at the expected loss with respect to the 
adopted prior. The reason is that the prior can no longer 
be expected to correctly reflect the agent's degrees of be- 
lief. It seems more meaningful to pick a single probabihty 
measure Pr and to analyze the behavior of the Bayesian un- 
der the assumption that Pr is the "true" state of nature. By 
varying Pr over the set V, we can get a sense of the behav- 

^It has been claimed that this time consistency also depends 
on the agent having perfect recall; see [Halpem 1997; Piccione 
and Rubinstein 1997] for some discussion of this issue. 



ior of Bayesian updating in all possible situations. This is 
the type of analysis that we adopt in this section; it is quite 
standard in the statistical literature on consistency of Bayes 
methods [Blackwell and Dubins 1962; Ghosal 1998]. 

We focus on a large class of priors on V that includes most 
standard recommendations for noninformative priors. Es- 
sentially, we show that for any prior in the class, when the 
sample size is small, ignoring information is better than us- 
ing the Bayesian posterior. That is, if a pragmatic agent 
has the choice between (a) first adopting a pragmatic prior, 
perhaps not correctly reflecting her own beliefs, and then 
reasoning Uke a Bayesian, or, (b) simply ignoring the avail- 
able information, then, when the sample size is small, she 
might prefer option (b). On the other hand, as more infor- 
mation becomes available, the Bayesian posterior behaves 
almost as well as ignoring the information in the worst case, 
and substantially better than ignoring in most other cases. 
(Of course, part of the issue here is what counts as "bet- 
ter" when uncertainty is represented by a set of probability 
measures. For the time being, we say that "A is better than 
B" if A achieves better minimax behavior than B. We return 
to this issue at the end of this section.) 

Example 3.1: As in Example 2.2, let A" = 3^ = {0, 1}. For 
definiteness, suppose that the known prior Pr^^ is such that 
Pry(y = 1) = p. Throughout this section we assume that 

< p < 1. A probabihty measure on A" x 3^ is completely 
determined by Pr(X = 1 | F = 1) and Pr(X = 1\ Y = 
0). Moreover, for every choice (a,/?) G [0, 1] x [0, 1] for 
these two conditional probabilities, there is a probability 
Pra,/3 e P; in fact 

Pr„,;3(^ = 1,1^ = 1)= pa; 
Pr„,^(X = l,y = 0) = (l-p)/3; 
Pr„.;3(X = 0,y = l)=p(l-a); 

Pr„,^(x = o,y = 0) = (1 - p). 

Notice that I'T:a,f){X = I) = pa + {I - p)(}. Given fliis, 
one obvious way to put a uniform prior on V is just to take 
a uniform prior on the square [0, 1]^; we adopt this prior 
for the time being and consider other notions of "uniform" 
further below. 

To calculate the Bayesian predictions of Y given X, 
we must first determine the Bayesian "marginal" prob- 
ability measure Pr, where Vr{X — i,Y — j) = 

/a=o//3=of*''"./3(^ = i,y = j)dad/3 ("marginal" be- 
cause we are marginalizing out the parameters a and /3), 
and then use Pr to calculate the expected loss of predict- 
ing y = 1. That is, we calculate the so-called "predictive 
distribution" Pr(y = ■ \ X = ■). We can calculate this 
directly without performing any integration as follows. By 
symmetry, we must have Pr(y = 1 \ X = 1) = Pr(Y = 

1 I X = 0). Now if 7 = Pr(X = 1), then it must be the 
case that 7P7(y = 1 | X =^ 1) + (1 - 7)P?(y ^1\X = 
0) = p; this implies that Pr(y = 1 \ X = 1) = p. Thus, 



when calculating the predictive distribution of Y after ob- 
serving X, the Bayesian will always ignore the value of X 
and predict with his marginal distribution Pr^^. 

Thus, before observing data, the Bayesian ignores the value 
of X, and thus makes minimax-optimal decisions. Poten- 
tially suboptimal behavior of the Bayesian can occur only 
after the Bayesian has observed some data. To analyze this 
case, we need to assume that we have a sequence of n ob- 
servations {Xi, Yi), . . . , {Xn, Yn) and are trying to predict 
the value of Yn+i, given the value of Xn+i. The distribu- 
tion Prc,/3 on A' X 3^ is extended to a distribution Pr" ^ 
on {X X 3^)" by assuming that the observations are inde- 
pendent. Of course, the hope is that the observations will 
help us learn about a and f3, allowing us to make better de- 
cisions on y. To take the simplest case, suppose that we 
have observed that {Xi,Yi) = (1, 1), and X2 = 1, and 
want to calculate the value of Y2. Note that 

p7'(>^=i 1x2 = 1, (Xi,yi) = (1,1)) 

Ej/e{o,i}P'-'((^2,i'2)=(i,y),(Xi,yi)=(i,i)) 

= 

ip 

~ P+3' 

Since we must have Pr^(y2 = 1 | {Xi,Yi) = (1,1)) = 
p, it follows using the same symmetry argument as above 
that if {Xi,Yi) = (1, 1), then, no matter what value of 
X2 is observed, the value of X2 is not ignored. Similar 
calculations show that, if (Xi,Yi) = for all i, j G 

{0, 1}, then, no matter what value of X2 is observed, the 
value of X2 is not ignored. | 

In Example 3.1 we claimed that the Bayesian should pre- 

2 

diet by the predictive distribution Pr (y = 1 | X = 1) as 
defined in the example. While this is the standard Bayesian 
approach, one may also directly consider the "expected" 
conditional probability /^^^^ jL^^g P^l^"^ 1 \ X = 1). 
These two approaches give different answers, since expec- 
tation does not coimnute with division. To see why we 
prefer the standard approach, note that, because of inde- 
pendence, FtI p{Y2 ^z\X2 = j, (Xi,yi) = = 

Pr„,/3((Xi,yi) = (i',i'))Pra,/3(5"2 = i I X2 = j); 
similarly with repeated observations. That is, with the al- 
ternative approach, there would be no learning from data. 
Thus, for the remainder of the paper, we use the predictive- 
distribution approach, with no further comment. 

Example 3.2: Now consider the more general situation 
where A' = {1, . . . , M} for arbitrary M, and 3; = {0, 1} 
as before. We consider a straightforward extension of the 
previous set of distributions: let a — (ai, q;m) be an 
element of the M-dimensional unit simplex; /? is defined 



similarly. Fix p £ [0, 1], and define 

Pv^jiY = 1) = p; Pr^jiX = j\Y=l) = aj; 
Pv.jiX = j\Y = 0)=Pj. 

Note that 

Pv^jiX = j,Y = l) = ajp 

and 

Pv^j{X=j,Y = 0)=p,il-p). 

Let be a random variable used to denote the outcome of 
then n observations (Xi, Yi), . . . , (X„, Given a se- 
quence (x, y) = {{xi , yi), .... (x„ . y,,)) of observations, 
let denote the number of observations in the se- 

quence with Yi — k, for k G {0, 1}. Similarly, n"(^j^^ de- 
notes the number of observations (Xj, 1^) in the sequence 
with {Xi = j, Yi = k). Then 

Prl^{D = {x,y)) 

„(«,ff), nM "(3,1) -i-rM o^U.o) 

= p"i (1 - p)"o Hj^i Uj^i 13 J 

We next put a prior onV = {Pr^ ^ : a, /? 6 [0, 1]} We 
restrict attention to priors that can be written as a prod- 
uct of Dirichlet distributions [Bernardo and Smith 1994]. 
A Dirichlet distribution on the M-dimensional unit sim- 
plex Am (which we can identify with the set of proba- 
biUty distributions on {1, . . . , M}) is parameterized by an 
Af -dimensional vector a. For a = (ai, . . . , gm), the a- 
Dirichlet distribution has density ps that satisfies, for all 
a e Am, 

/ -"x ^ ai— 1 aM~l 

Ps{a) = 'z{gj"^ • • • • • > 

where Z{a) = J^^^^ a"^"^ ■ . . . • a'^~^dd is a normal- 
izing factor. Note that the uniform prior is the a-Dirichlet 
prior where oi = 02 = - . . = om = 1- As we shall see, 
many other priors of interest are special cases of Dirichlet 
priors. 

We consider only priors w on V that satisfy w{a,f3) = 
ws{a)w^{(5) for all a, /3 G Am, where ws and are of 
the Dirichlet form. Then 

P7"(£) = (x,y)) 

(2) 

Now suppose that a Bayesian has observed an initial sam- 
ple D of size n and Xn+i, and must predict Yn+i. Suppose 
Xn+i = k. Then the Bayesian's predictive distribution be- 
comes Pr"^^(y^_i_i = • I Xn+i,D) or, more explicitly, 

Pv'^'iYn+i = 3 I = k,D = {x,y)) 

^ Pr^+\D={x,y),X„+i = k,Y„ + i=j) 
P7"+'(D=(x,i/),X„+i = fc) ■ 



It will be convenient to represent this distribution by the 
odds ratio 

'P^""^\Y„+i = l\X„+i=k,D={x,y)) 

P7"+'(Y-„+i=0|X„+i=fc,D=(S,jn) /3-, 

^ Pr"+'(D^(g,y),X„+i^/c,y„ + i^l) 

p;''+\D=(?,y),X„+i=fc,y„ + i=0) ■ 

Both the numerator and the denominator of the right-hand 
side of (3) are of the form (2), so this expression is a ratio 
of Dirichlet integrals. These can be calculated expUcitiy 
[Bernardo and Smith 1994], giving 

p7"+'(y„+i==i|x„+i=fc,D=(x,i/)) 
p7"+'(y„+i=o|x„+i=fe,£>=(x,ir)) 

^ Pr"^'(g^(g,y),X„+i=fc,y„+i = l) 
PT"-'^\D=ix,y),X„+i=k,Y„+i=0) 

"(fc,0)+'''= "1 +Zjfe=l 

With the uniform prior, (4) simplifies to 

(4) and (5) show that the odds-ratio behaves like p/{l— p) 
(which would be the odds-ratio obtained by ignoring the 
values of X) times some "correction factor". Ideally this 
correction factor would be close to 1 for small samples and 
then smoothly change "in the right direction", so that the 
Bayesian's predictions are never much worse than the min- 
imax predictions and, as more data comes in, get monoton- 
ically better and better. We now consider two examples to 
show the extent to which this happens. 

First, take M = 2, and let Pr be such that Pr(Y = 1) = p, 
Pr{X = 1\Y = 1) = 1, and Pr(X = 0|r = 0) = 1. 
Then, for k = 1, (4) becomes 

P7"-+'(Y-„+i=l|X„+i=l,D=(x,g)) 
Pr''+\r„+i=0|X„+i = l,D=(x,iO) 

For all but the smallest n, with high Pr-probabUity, 
n^'i'^^ ~ pn. Thus, the odds ratio tends to infinity, as ex- 
pected. 

In the previous example, X and Y were completely corre- 
lated. Suppose that they are independent. That is, suppose 
again that M = 2, but that Pr is such that Pr(F = I \ X = 
k) = p, for k = 0,1. For simplicity, we further suppose 
that p = 1/2 and that Pt{X = 0) = Pr{X = 1) = 1/2; 
the same argument applies with httle change if we drop 
these assumptions. 

Given a > 1, consider a loss function La with asymmetric 
misclassification costs, given by _La(0, 0) =iQ,(l,l) = 0; 

0) = 1; L„(0, 1) = a. Clearly, Ep,^ [Lq] = 0.5 and 
Epry [Li] = 0.5a. Thus, the optimal action with respect 



to the prior Pry is to predict 0, and the minimax-optimal 
action is to always predict 0. Moreover, the expected loss 
of predicting 1 is .5(a — 1). 

Now consider the predictions of a Bayesian who uses the 
uniform prior. The Bayesian will predict 1 iff 

^1^(y„+i|x„+i.D)[^il 

%7(Y-„+i|X„ + i.D)[^«l 

Pr(y„ + i = l|X„+i=fe,Z)=(S,jr)) 

From the odds-ratio (5) we see that this holds iff 

^ P7(y„+l = l|X„+i^fc,U=(a;,tr)) 
Pr(y„+i=0|X„+i=fc,_D=(x,ir)) 

If f3 is the probability (with respect to Pr") of (6), then 
the difference between the Bayesian's expected loss and the 
expected loss of someone who ignores the information is 
0{a — 1) /2. Clearly, /3 depends on a and n. Moreover, for 
any fixed a > 1, lim„^oo /? ^ 0. This, of course, just says 
that eventually the Bayesian will learn correctly. However, 
for relatively small n, it is not hard to construct situations 
where (3{a — l)/2 can be nontrivial. For example, if n = 4 
and a = 1.4, then (3 ~ .35. (We computed this by a brute 
force calculation, by considering all the values ^[^'J) that 
cause (6) to be true, and computing their probability.) Thus, 
the Bayesian's expected loss is about 14% worse than that 
of an agent who ignores the information. 

Although in this example there is no dependence between 
X and Y in the actual distribution, by continuity, the same 
result holds if there is some dependence. | 

This conclusion assumed that a Bayesian chose a partic- 
ular noninformative prior, but it does not depend strongly 
on this choice. As is well known, there is no unique way 
of defining a "uniform prior" on a set of distributions V, 
since what is "uniform" depends on the chosen parame- 
terization. For this reason, people have developed other 
types of noninformative priors. One of the most well- 
known of these is the so-called Jeffreys' prior [Jeffreys 
1946; Bernardo and Smith 1994], specifically designed as 
a prior expressing "ignorance". This prior is invariant un- 
der continuous 1-to-l reparameterizations of P. It turns out 
that Jeffreys' prior on the set V is also of the Dirichlet form 
(with fli = . . . = a„ = 6i = . . . = 5„ = 1/2) so that it 
satisfies (4) (see, for example, [Kontkanen, Myllymaki, Si- 
lander, Tirri, and Griinwald 2000]). Other pragmatic priors 
that are often used in practice are the so-called equivalent 
sample size (ESS) priors [Kontkanen, Myllymaki, Silan- 
der, Tirri, and Griinwald 2000]. For the case of our P, 
these also take the Dirichlet form. Thus, the analysis of 
Example 3.2 does not substantially change if we use the 
Jeffreys' prior or an ESS prior. It remains the case that, for 
certain sample sizes, ignoring information is preferable to 
using the Bayesian posterior. 



Example 3.2 shows that with noninformative priors, for 
small sample sizes, ignoring the information may be bet- 
ter than Bayesian updating. Essentially, the reason for this 
is that all standard noninformative priors assign probability 
zero to the set of distributions V CV according to which 
X and Y are independent. But the measures in V' are ex- 
actly the ones that lead to minimax-optimal decisions. Of 
course, there is no reason that a Bayesian must use a non- 
informative prior. In some settings it may be preferable to 
adopt a "hierarchical pragmatic prior" that puts a uniform 
probabiUty on both V — V and V , and assigns probabiUty 
0.5 to each oiV — cP' and V. Such a prior makes it eas- 
ier for a Bayesian to learn that X and Y are independent. 
(A closely related prior has been used by Barron, Rissanen, 
and Yu [1998], in the context of universal coding, with a 
logarithmic loss function.) With such a prior, a Bayesian 
would do better in this example. 

The notion of optimaUty that we have used up to now is 
minimax loss optimality. Prediction i is better than pre- 
diction j if the worst-case loss when predicting i (taken 
over all possible priors in V) is better than the worst-case 
loss when predicting j. But there are certainly other quite 
reasonable criteria that could be used when comparing pre- 
dictions. In particular, we could consider minimax regret. 
That is, we could consider the prediction that minimizes 
the worst-case difference between the best prediction for 
each Pr e "P and the actual prediction. In the second half 
of Example 3.2, we calculated that if the true probability 
Pr is such that Pr(X = 0) = Pr{X = 1) = 1/2 and 
Pr makes X and Y independent, then the difference be- 
tween the loss incurred by an agent that ignores the prior 
and a Bayesian is roughly .07. We do not know if there 
are probabilities Pr' for which the Bayesian agent does 
much worse than an agent who ignores the prior with re- 
spect to Pr'. On the other hand, if X and Y are completely 
correlated, that is, if the true probability Pr is such that 
Pr(y = 1 \ X = 1) = Pr(y = I X = 0) = 1, then 
if n = 4 and a = 1.4, the Bayesian will predict correctly, 
while half the time the agent that ignores information will 
not. Then the difference between the loss incurred by the 
Bayesian agent and the agent that ignores the information 
is 0.5. Thus, in the sense of expected regret, the Bayesian 
approach is bound to be at least as good as ignoring the in- 
formation in this example. We are currently investigating 
whether this is true more generally. 

4 PARTIAL IGNORANCE AND 

DIFFERENT TYPES OF BOOKIES 

In Sections 2 and 3, we showed that ignoring information 
is sensible as long as (1) the set V contains all distribu- 
tions on X X y with the given marginal Pry, and (2) the 
loss function L is fixed; in particular, it does not depend on 
the realized value of X. In this section, we consider what 



happens when we drop these assumptions. 

The assumption that V contains all distributions on X xy 
with the given marginal Vvy amounts to the assumption 
that all the agent knows is Vvy If an agent has more infor- 
mation about the probability distribution on X x y, then 
ignoring information is in general not a reasonable thing 
to do. To take a simple example, suppose that the set V 
contains only one distribution P° . Then clearly the mini- 
max optimal strategy is to use the decision rule based on 
the conditional distribution P°{Y \ X), which means that 
all available information is taken into account. Using Py 
is clearly not the right thing to do. 

On the other hand, if V is neither a singleton nor the set 
of all distributions with the given marginal Pri', then ig- 
noring may or may not be minimax optimal, depending on 
the details of V. Even in some cases where ignoring is not 
minimax optimal, it may still be a reasonable update rule to 
use, because, no matter what V is, ignoring X is a reliable 
update rule [Grtinwald 2000]. This means the following: 
Suppose that the loss function L is known to the agent. Let 
a* be the optimal action resulting from ignoring informa- 
tion about X, that is, adopting the marginal Vvy as the 
distribution of Y, independently of what X was observed. 
Then it must be the case that 



A priori, there are four possible deterministic decision 
rules, which have the form "Predict i if is observed 
and j if 1 is observed", which we abbreviate as 5ij, for 
i,j S {0, 1}. It is easy to check that 



£;p,,[L„.]=4f'^)[L„.(X,F)] 



(7) 



meaning that the loss the agent expects to have using his 
adopted action a* is guaranteed to be identical to the true 
expected loss of the agent's action a*. Thus, the quahty of 
the agent's predictions is exactly as good as the agent thinks 
they are, and the agent cannot be overly optimistic about 
his own performance. Data will behave as if the agent's 
adopted distribution Pry is correct, even though it is not. 

This desirable property of rehabihty is lost when the loss 
function can depend on the observation X. To understand 
the impact of this possibility, consider again the situation 
of Example 2.2, except now assume that the loss function 
can depend on the observation. 

Example 4.1: As in Example 2.2, assume that X = y = 
{0, 1}, that the agent knows that iJpr^ [F] = Pry(y = 
1) = p for some fixed p, and let V be the set of distributions 
on X y,y with marginal Pry. Now the loss function takes 
three arguments, where j, k) is the loss if i is predicted, 
the true value of Y is j, and X — k'l?, observed. Suppose 
that L{i, j, k) — {k + — j\. That is, if the observation 
is 0, then, as before, the loss is just the difference between 
the predicted value and actual value; on the other hand, if 
the observation is 1, then the loss is twice the difference. 
Note that, with this loss function, it technically no longer 
makes sense to talk about ignoring the information, since 
we cannot even talk about the optimal rule with respect to 
Pry. However, as we shall see, the optimal action is still to 
predict the most likely value according to Pry. 



£:p,[L,-„„] =Pr(l,0) + 2Pr(l,l) 
Spr[i5oi] =Pr(l,0) + 2Pr(0,l) 
iJprlLaJ =Pr(0,0) + 2Pr(l,l) 
£;p,[L5,J =Pr(0,0) + 2Pr(0,l) 



:Pry(l) + Pr(l,l) 



:Pry(0) + Pr(0,l). 



It is not hard to show that randomization does not help in 
this case, and the minimax optimal decision rule is to pre- 
dict if Pry(l) = p < 1/2 and 1 if p > 1/2 (with any 
way of randomizing leading to the same loss if p = 1/2). 
Thus, the minimax-optimal decision rule still chooses the 
most likely prediction according to Pry, independent of 
the observation. 

On the other hand, if either or 1 is observed, the same 
arguments as before show that the minimax-optimal action 
with respect to the conditional probability is to randomize, 
predicting both and 1 with probabihty 1/2. So again, we 
have time inconsistency in the sense discussed in Section 2, 
and ignoring the information is the right thing to do. 

But now consider what happens when the loss function is 
L'{i,j, k) = {\k — j\ + — j\. Thus, if the actual value 
and the observation are the same, then the loss function is 
the difference between the actual value and the prediction; 
however, if the actual value and the observation are differ- 
ent, then the loss is twice the difference. Again we have the 
same four decision rules as above, but now we have 

Epr[L'sJ - 2Pr(l, 0) + Pr(l, 1) = Pry(l) + Pr(l, 0) 
i?P,[L',„J =2Pr(l,0) + 2Pr(0,l) 
i;p,[L,^J =Pr(0,0) + Pr(l,l) 

Epr[Ls,,] = Pr(0, 0) -h 2 Pr(0, 1) = Pry (0) + Pr(0, 1). 

Now the minimax-optimal rule is to predict if Pry (1) < 
1/3, to predict 1 if Pry(l) > 2/3, and to use the ran- 
domized decision rule |5oi + |<5io (which has expected 
loss 2/3) if 1/3 < Pry(l) < 2/3). (In the case that 
Pry(l) = 1/3 or Pry(l) = 2/3, then the two recom- 
mended rules have the same payoff.) 

On the other hand, if i is observed (i G {0, 1}), then the 
minimax optimal action is to predict i with probability 1/3 
and 1—i with probability 2 /3. That is, the optimal strategy 
corresponds to the decision rule |(5oi + f^io- Thus, in this 
case, there is no time inconsistency if 1/3 < Pry(l) < 
2/3.1 

5 RELATED WORK 

In this section we compare our work to recent and closely 
related work by Seidenfeld [2004] and Augustin [2003], as 
well as to various results indicating that information should 
never be ignored. 



Augustin's and Seidenfeld's work Seidenfeld 12004] 
provides an analysis of minimax decision rules which is 
closely related to ours, but with markedly different conclu- 
sions. Suppose an agent has to predict the value of a ran- 
dom variable Y after observing another random variable 
X. Seidenfeld observes, as we did in Section 2, that the 
minimax paradigm can be appUed to this situation in two 
different ways: 

1. In the local minimax strategy, the agent uses the min- 
imax action relative to the set of distributions for Y 
conditioned on the observed value of X. 

2. In the global minimax strategy, the agent adopts the 
minimax decision rule (function from observations of 
X to actions) relative to the set V of joint distribu- 
tions. 

Seidenfeld notes, as we did in Section 2, that the local min- 
imax strategy is not equivalent to the global minimax strat- 
egy. (In his terminology, the extensive form of the decision 
problem is not equivalent to the normal form.) Moreover, 
he exhibits a rather counterintuitive property of local min- 
imax. Suppose that, before observing X, the agent is of- 
fered the following proposition. For an additional small 
cost (loss), she will not be told the value of X before she 
has to predict Y. An agent who uses the local minimax 
strategy would accept that proposition, because not observ- 
ing X leads to a smaller minimax prediction loss than ob- 
serving X. Therefore, a local minimax agent would be 
willing to pay not to get information. This is the same phe- 
nomenon that we observed in Example 2.2. 

Seidenfeld interprets his observations as evidence that the 
local minimax strategy is flawed, at least to some extent. 
He further views the discrepancy between local and global 
minimax as a problematic aspect of the minimax paradigm. 
In a closely related context, Augustin [2003] also observes 
the discrepancy between the global and the local minimax 
strategy, but, as he writes, "there are sound arguments for 
both". 

In this paper, we express a third point of view: we regard 
both the strategy of ignoring information and the global 
minimax loss strategy as reasonable decision rules, prefer- 
able to, for example, the local minimax loss strategy. How- 
ever, we certainly do not claim that "global minimax loss" 
is the only reasonable strategy. For example, as we ex- 
plained at the end of Section 3, in some situations mini- 
max regret may be more appropriate. Also, as explained in 
Section 4, if V has a more complex structure than the one 
considered in Sections 2 and 3, then ignoring the informa- 
tion may no longer coincide with a global minimax strat- 
egy. It remains to be investigated whether, in such cases, 
there is a clear preference for either ignoring information 
or for global minimax. 



"Cost-free information should never be ignored" As 

we observed in Section 3, a purely subjective Bayesian 
who is not "pragmatic" in our sense should always con- 
dition on all the available information: information should 
never be ignored. This result can be reconciled with our 
findings by noting that it depends on the agent representing 
her uncertainty with a single distribution. In the Bayesian 
case, the agent starts with a set of distributions P, but this 
set is then transformed to a single distribution by adopting 
a subjective prior on V. The expected value of informa- 
tion is then calculated using an expectation based on the 
agent's prior on V. In contrast, in the "Bayesian" analysis 
of Section 3, for reasons explained at the beginning of Sec- 
tion 3, we computed the expectation relative to all prob- 
abiUties in a set V that is meant to represent the agent's 
uncertainty. Consequently, our results differ from the sub- 
jective Bayesian analysis. 

6 Discussion 

We have shown that, in the minimax sense, sometimes it is 
better to ignore information, at least for a while, rather than 
updating. This strategy is essentially different from other 
popular probability updating mechanisms such as the non- 
Bayesian mechanism (local minimax) described in Sec- 
tion 3 and the Bayesian mechanism of Section 2. The only 
method we are aware of that leads to similar results is the 
following form of the maximum-entropy formalism: the 
agent first chooses the unique distribution P* £ V that 
maximizes the Shannon entropy, and then predicts Y based 
on the conditional distribution P*{Y = ■ \ X = x) [Cover 
and Thomas 1991]. Such an application of the Maximum 
Entropy Principle will ignore the value of X if P contains 
all distributions with the given marginal Pry. However, as 
we indicated in Section 4, updating by ignoring can still 
be useful if P contains only a subset of the distributions 
with given Pr^-. Yet in such cases, it is well known that 
the maximum entropy P* may introduce counterintuitive 
dependencies between X and Y after all, as exemplified by 
the Judy Benjamin problems [Grove and Halpern 1997], 
thereby making the method different from merely ignor- 
ing X after all. Our minimax-optimality results depend on 
the assumption that the set of possible prior distributions 
contains no information about the possible correlations be- 
tween the variable of interest and the observed variable. In 
addition, they depend on the assumption that the payoff de- 
pends only on the actual value and the predicted value of 
the variable of interest. 

One way of understanding the issues involved here is in 
terms of knowledge, as advocated by Halpern and Tuttle 
[1993], specifically, the knowledge of the agent and the 
knowledge of the "adversary" who is choosing the loss 
function. The knowledge of the agent is encoded by the set 
of possible prior distributions. The knowledge of the adver- 
sary is encoded in our assumptions on the loss function. If 



the adversary does not know the observation at the time that 
the loss function is determined, then the loss function can- 
not depend on the observation; if the adversary knows the 
observation, then it can. More generally, especially if neg- 
ative losses (i.e., gains) are allowed, and the adversary can 
know the true distribution, then the adversary can choose 
whether to allow the agent to play at all, depending on the 
observation. In future work, we plan to consider the impact 
of allowing the adversary this extra degree of freedom. 
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